[BlockSTM] Add latency counter for profiling BlockSTM #6956

sitalkedia · 2023-03-06T22:24:53Z

Description

Add some more instrumentation to block STM to understand performance bottlenecks. Also, fix the APTOS_EXECUTOR_EXECUTE_BLOCK_SECONDS to measure the latency of the entire function call.

Test Plan

Tested with Forge and executor-benchmark and ensures this doesn't introduce any regression.

github-actions · 2023-03-06T23:10:21Z

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `a491021dab57520dd0a86ee69682b4bc4f274c5c`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> a491021dab57520dd0a86ee69682b4bc4f274c5c (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7953 TPS, 4786 ms latency, 7300 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: a491021dab57520dd0a86ee69682b4bc4f274c5c
compatibility::simple-validator-upgrade::single-validator-upgrade : 5074 TPS, 8038 ms latency, 11100 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: a491021dab57520dd0a86ee69682b4bc4f274c5c
compatibility::simple-validator-upgrade::half-validator-upgrade : 4744 TPS, 8546 ms latency, 11600 ms p99 latency,no expired txns
4. upgrading second batch to new version: a491021dab57520dd0a86ee69682b4bc4f274c5c
compatibility::simple-validator-upgrade::rest-validator-upgrade : 7144 TPS, 5392 ms latency, 8700 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> a491021dab57520dd0a86ee69682b4bc4f274c5c passed
Test Ok

github-actions · 2023-03-06T23:12:00Z

✅ Forge suite `framework_upgrade` success on `cb4ba0a57c998c60cbab65af31a64875d2588ca5` ==> `a491021dab57520dd0a86ee69682b4bc4f274c5c`

Compatibility test results for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> a491021dab57520dd0a86ee69682b4bc4f274c5c (PR)
Upgrade the nodes to version: a491021dab57520dd0a86ee69682b4bc4f274c5c
framework_upgrade::framework-upgrade::full-framework-upgrade : 6913 TPS, 5540 ms latency, 8300 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> a491021dab57520dd0a86ee69682b4bc4f274c5c passed
Test Ok

github-actions · 2023-03-06T23:14:01Z

✅ Forge suite `land_blocking` success on `a491021dab57520dd0a86ee69682b4bc4f274c5c`

performance benchmark with full nodes : 5919 TPS, 6627 ms latency, 16300 ms p99 latency,(!) expired 2780 out of 2530320 txns
Test Ok

danielxiangzl

I run the parallel execution only benchmark on this PR, there seems to be no performance degradation. Can you also try the execution benchmark to see the performance?

sitalkedia · 2023-03-07T23:49:22Z

@danielxiangzl - Yes I ran both Forge and execution benchmark with and without this change and there is no performance degradation.

aptos-move/block-executor/src/counters.rs

gelash

Thanks!

gelash · 2023-03-08T12:44:41Z

aptos-move/block-executor/src/counters.rs

+pub static PARALLEL_EXECUTION_SECONDS: Lazy<Histogram> = Lazy::new(|| {
+    register_histogram!(
+        // metric name
+        "aptos_parallel_execution_seconds",


I wonder if it helps to have all these names start with "aptos_execution", in which case we could make this one aptos_execution_seconds and the next one aptos_execution_rayon_seconds or smt similar?

@gelash - The PR got auto-merged, I will sneak the renaming changes in some other PR if that's okay.

gelash · 2023-03-08T12:45:47Z

aptos-move/block-executor/src/counters.rs

+        // metric name
+        "aptos_execution_get_next_task_seconds",
+        // metric description
+        "The time spent in seconds for getting next task from the scheduler",


Block-STM scheduler (or Block Executor scheduler, or Parallel Executor / execution scheduler)

gelash · 2023-03-08T12:47:13Z

aptos-move/block-executor/src/counters.rs

+        // metric name
+        "aptos_execution_work_with_task_seconds",
+        // metric description
+        "The time spent in work task with scope call in Block STM",


seems like most descriptions say parallel execution (or we can use Block STM or parallel / block executor everywhere)

[BlockSTM] Add latency counter for fetching next task

15edfb1

sitalkedia requested review from gelash, zekun000, sasha8 and danielxiangzl as code owners March 6, 2023 22:24

sitalkedia added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Mar 6, 2023

sitalkedia requested a review from grao1991 March 6, 2023 22:26

Fix build

65ce897

sitalkedia requested review from msmouse and lightmark as code owners March 6, 2023 22:27

Fix description

a491021

sitalkedia enabled auto-merge (squash) March 6, 2023 22:31

This comment has been minimized.

Sign in to view

sitalkedia changed the title ~~[BlockSTM] Add latency counter for fetching next task~~ [BlockSTM] Add latency counter for profiling BlockSTM Mar 7, 2023

danielxiangzl approved these changes Mar 7, 2023

View reviewed changes

grao1991 reviewed Mar 8, 2023

View reviewed changes

aptos-move/block-executor/src/counters.rs Show resolved Hide resolved

gelash approved these changes Mar 8, 2023

View reviewed changes

sitalkedia merged commit 66c0a3e into main Mar 8, 2023

sitalkedia deleted the block_stm_counters_1 branch March 8, 2023 12:47

sitalkedia mentioned this pull request Sep 30, 2023

Remove expensive counters #10188

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BlockSTM] Add latency counter for profiling BlockSTM #6956

[BlockSTM] Add latency counter for profiling BlockSTM #6956

sitalkedia commented Mar 6, 2023 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Mar 6, 2023

github-actions bot commented Mar 6, 2023

github-actions bot commented Mar 6, 2023

danielxiangzl left a comment

sitalkedia commented Mar 7, 2023

gelash left a comment

gelash Mar 8, 2023

sitalkedia Mar 8, 2023

gelash Mar 8, 2023

gelash Mar 8, 2023

[BlockSTM] Add latency counter for profiling BlockSTM #6956

[BlockSTM] Add latency counter for profiling BlockSTM #6956

Conversation

sitalkedia commented Mar 6, 2023 • edited Loading

Description

Test Plan

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Mar 6, 2023

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> a491021dab57520dd0a86ee69682b4bc4f274c5c

github-actions bot commented Mar 6, 2023

✅ Forge suite framework_upgrade success on cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> a491021dab57520dd0a86ee69682b4bc4f274c5c

github-actions bot commented Mar 6, 2023

✅ Forge suite land_blocking success on a491021dab57520dd0a86ee69682b4bc4f274c5c

danielxiangzl left a comment

Choose a reason for hiding this comment

sitalkedia commented Mar 7, 2023

gelash left a comment

Choose a reason for hiding this comment

gelash Mar 8, 2023

Choose a reason for hiding this comment

sitalkedia Mar 8, 2023

Choose a reason for hiding this comment

gelash Mar 8, 2023

Choose a reason for hiding this comment

gelash Mar 8, 2023

Choose a reason for hiding this comment

sitalkedia commented Mar 6, 2023 •

edited

Loading

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `a491021dab57520dd0a86ee69682b4bc4f274c5c`

✅ Forge suite `framework_upgrade` success on `cb4ba0a57c998c60cbab65af31a64875d2588ca5` ==> `a491021dab57520dd0a86ee69682b4bc4f274c5c`

✅ Forge suite `land_blocking` success on `a491021dab57520dd0a86ee69682b4bc4f274c5c`